Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transaction logs and basic conflict resolution #403

Merged
merged 10 commits into from
Nov 24, 2024
Merged

Conversation

paraseba
Copy link
Contributor

No description provided.

@paraseba paraseba force-pushed the push-kmsppxuloqls branch 4 times, most recently from ef8ebfb to cb287a1 Compare November 23, 2024 03:42
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seba this looks amazing. I gave this what I would call a "shallow" review. Overall the approach makes sense, the data structures look right, tests are legible, etc.

The following observation is in no way a blocker for moving forward with this PR...

I wonder about the overall design of the transaction logs. Right now the logs are useful for detecting conflicts only, but they don't have the full information from the changeset. I'm curious whether it would make sense to just serialize the full changeset as the transaction log. This is what lance does: https://lancedb.github.io/lance/format.html

The advantage of this is that you could potentially recover an entire session if a commit fails. (Currently this information just lives in memory.) The downside of course is that the changeset is much larger and more expensive to store / parse.

Comment on lines +121 to +122
#[allow(clippy::panic)]
VersionSelection::Fail => panic!("Bug in conflict resolution: ChunkDoubleUpdate flagged as unrecoverable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we will never actually hit this code path in normal usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this possibility is filtered out above, but the compilers doesn't know it.

icechunk/src/conflicts/detector.rs Outdated Show resolved Hide resolved
Comment on lines +763 to +764
for snap_id in new_commits.into_iter().rev() {
let tx_log = self.storage.fetch_transaction_log(&snap_id).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought about this loop which is probably a very premature optimization...

Rather than fetching each commit one at a time and computing the conflicts one at a time, couldn't we fetch all of the transaction logs (in one async gather), merge them together, and compute conflicts once?

This optimization could be important if there are lots of commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. That's what I had initially, and it's what sounded more intuitive to me too. Following is why I gave up on it on this first pass, we can always go back eventually:

  • The logic o merging the transaction logs is not trivial. In fact, it looks a lot like the conflict resolution logic. Example: if one commit deletes an array and the next commit creates a group on the same path, and edits the attributes, etc. It gets very messy, and it will be pretty crazy once we introduce rename
  • In the current design I was very careful in that if you are able to rebase only half of the commits, you get back a repo rebased to that point, you can change your policy (or do some manual changes) and continue with another rebase. If we initially merge all the tx logs it's all or nothing.
  • Memory usage is much larger if you have to merge many tx logs.
  • Error reporting becomes untractable under tx log merge

I should have written a design document with all these details, but I was to bored of this problem already hehe.

As i mentioned somewhere, I don't feel I got to the best design for this problem. I'm pretty sure we'll have to revisit soon.

@@ -725,6 +734,63 @@ impl Repository {
}
}

pub async fn rebase(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about what exactly this important function does would be very helpful.

My interpretation is that, if successful, it mutates the state of the repo such that it can commit successfully to the specified branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your interpretation is correct. I'll document, and include the tricky case of "partial success" which is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pub async fn rebase(
&mut self,
solver: &dyn ConflictSolver,
update_branch_name: &str,
Copy link
Contributor

@rabernat rabernat Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be to update against the repo's current active branch. I can see always having to specify this explicitly as a potential pain point for users. 95% of the time there will just be a single branch active in a repo.

Edit: I guess this is just a general feature of the Rust API, so probably not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the low level API, and there is no notion of current branch at this level, but we will definitely make it the default (or only option) in the python bindings.

repo2.rebase(&ConflictDetector, "main").await,
);
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very clear and easy to follow.

repo2
.set_user_attributes(
path.clone(),
Some(UserAttributes::try_new(br#"{"foo":"bar"}"#).unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that this conflict should actually be resolvable since the changes are the same! 😆 But I understand of course that sort of resolution is not implemented yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!!! This is a bit more advanced feature that I want us to have eventually. I made it so to get the test failure once we implement.

}

#[tokio::test()]
async fn test_rebase_without_fast_forward() -> Result<(), Box<dyn Error>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hard for me to understand what this is testing just by reading the code. Consider adding some comments to this and the next test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, I'll add a brief comment to every test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: coming from python, pytest, and the wonderful world of test parametrization, these tests feel very verbose and somewhat repetitive. Not sure if there is any remedy for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's painful ... people tend to use macros. But in general, our tests are bad, too verbose, no helper functions. I want us to refactor all of them into something much more readable and easy to write. We have #380 to work on it.

pub deleted_arrays: HashSet<NodeId>,
pub updated_user_attributes: HashSet<NodeId>,
pub updated_zarr_metadata: HashSet<NodeId>,
pub updated_chunks: HashMap<NodeId, HashSet<ChunkIndices>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does updated chunks include deleted chunks?

More generally, how do we handle deleted chunks? Do we treat them the same as any other chunk write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's included there. For conflict detection today, we are treating it as any other write, which is not great. If they both delete a chunk it shouldn't be a conflict. But also, easy to fix with the default UseOurs policy. We can revisit later unless you have an important use case I haven't thought about.

@paraseba
Copy link
Contributor Author

I'm curious whether it would make sense to just serialize the full changeset as the transaction log

It's definitely a valid point in the design space. This was my reasoning to chose the current style:

  • Most of the time you don't need or want to recover the session, it would make things slower to save you 1 every 10k times.
  • When you do, you try to protect from machine failure during the whole session, not during the brief time the commit lasts. If your machine dies during the 8 hours your writes last, you lose everything.
  • For that, we would need real session checkpoints, and those really look like storing the change set. But they need to be cleaned, we need to decide on a frequency for checkpoints, etc. So it felt like a very interesting feature to add later. Also, it's a feature that sounds very connected to "change sets that don't fit in memory", which is also something we should implement eventually.
  • The transaction logs have less information than the change sets, but in a more useful way, which simplifies the rebase (and hopefully the diff) functionality. For example, in the tx log I'm storing the node ids of all deleted nodes, as oppose to the change set, where we have only the paths.

I wish I had written a design doc...

Copy link
Contributor

@mpiannucci mpiannucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive and the tests are really easy to follow along to

pub struct TransactionLog {
// FIXME: better, more stable on-disk format
pub icechunk_transaction_log_format_version: IcechunkFormatVersion,
pub new_groups: HashSet<NodeId>,
Copy link
Contributor

@mpiannucci mpiannucci Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i understand correctly, we keep NodeId for the changes, which the Repository can then (in the future) resolve into a more readable or custom format by resolving the zarr path from the ID in the given manifest right? Does the NodeId refer to the nodes id in the incoming changeset or from the existing changset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. NodeIds are slightly better than paths, because they are random, so something like create /foo -> delete /foo -> create /foo doesn't get tricky.

This is the NodeId assigned in the change that generate this transaction, so in your description, the existing changeset.

@paraseba paraseba merged commit 459296e into main Nov 24, 2024
3 checks passed
@paraseba paraseba deleted the push-kmsppxuloqls branch November 24, 2024 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants